feat(cloud): Hetzner control plane IaC + data plane naming + legacy milady-core deprecation#7890
Conversation
…ilady-core deprecation Three coordinated pieces: 1. Terraform module `packages/cloud-infra/cloud/terraform/hetzner/control-plane/` declares the persistent VM(s) that host the orchestrator daemon (provisioning-worker, agent-router, headscale, cloudflared). Uses hetznercloud/hcloud + cloudflare providers, Cloudflare R2 as S3 state backend. Includes a cloud-init bootstrap template, tfvars examples, and a README walkthrough for both new-host bootstrap and `terraform import` of the existing prod VM (89.167.63.246) into state. 2. Data-plane naming: `node-<hex>` becomes `eliza-core-<hex>` going forward. `generateNodeId()` now sources entropy from `crypto.getRandomValues()` instead of `Math.random().toString(16)`, which silently strips trailing zeros and could produce short or colliding suffixes when `node_id` is UNIQUE in `docker_nodes`. 3. Data-plane location default fixed: Hetzner deprecated cpx32 on `ash` (Ashburn), so the previous `defaultHcloudLocation = "ash"` default fails with "unsupported location for server type". Flipped to `fsn1` to match the actual prod fleet. 4. Migration 0132 disables the 6 legacy `milady-core-*` rows (`enabled=false`, `capacity=8`). They were inserted by hand in 2026-03 with `capacity=100` (unrealistic for cpx32), have been health-check offline for weeks, and are now ignored by the autoscaler. Existing sandboxes keep running on the underlying Docker daemons until their next user-triggered restart, at which point the daemon provisions a replacement on a fresh autoscaled core. Ops follow-up (delete Hetzner servers + DB rows) is documented in the architecture markdown. ARCHITECTURE.md formalises the two-tier model (static control plane vs elastic data plane) so future ops actions have a clear runbook. Followups (separate PRs): Terraform modules for headscale state + the cloudflared tunnel; terraform-apply GitHub workflow; rapatriating the 4 remaining cron paths off the orphan container-control-plane service onto the daemon-queue pattern; raising the Hetzner Cloud server-count limit. Tests: - 2 new sociable tests for generateNodeId() asserting the prefix + exactly 8 lowercase hex chars + uniqueness across 50 calls. All 5 node-autoscaler tests pass. Out-of-band ops actions needed before merging to production: - Generate R2 API token + create the bucket entry (already done: eliza-terraform-state in WEUR) - Set environment secrets used by the daemon: HETZNER_CLOUD_API_KEY, CONTAINERS_AUTOSCALE_PUBLIC_SSH_KEY (already done on staging VM) - Open Hetzner ticket to raise server-count limit past 10 so autoscale can actually create replacement cores
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
| users: | ||
| - name: deploy | ||
| groups: sudo, docker | ||
| shell: /bin/bash | ||
| sudo: ALL=(ALL) NOPASSWD:ALL | ||
| lock_passwd: true | ||
|
|
There was a problem hiding this comment.
deploy user has no SSH authorized keys — GitHub Actions deploy workflow will be unable to connect
Hetzner's SSH key injection only populates root's ~/.ssh/authorized_keys. The deploy user is created with lock_passwd: true and no ssh_authorized_keys entry, making it unreachable via SSH. The README's deploy step triggers deploy-eliza-provisioning-worker.yml which presumably SSHes into this user — that will fail until keys are injected out-of-band.
| - curl -fsSL https://get.docker.com | sh | ||
| - systemctl enable --now docker | ||
|
|
||
| # Bun runtime for the deploy user (the daemons run under bun/tsx). | ||
| - su - deploy -c 'curl -fsSL https://bun.sh/install | bash -s "bun-v1.3.13"' |
There was a problem hiding this comment.
Unverified
curl | sh installs on the control-plane VM
Both Docker and Bun installs pipe remote scripts into a shell without checksum verification. This VM holds DATABASE_URL, HCLOUD_TOKEN, Headscale state, and the cloudflared tunnel — a higher-value target than a data-plane node. A supply-chain or MITM attack at bootstrap time would silently compromise the entire control plane.
| } | ||
|
|
||
| resource "hcloud_ssh_key" "operators" { | ||
| for_each = { for idx, key in var.ssh_public_keys : idx => key } |
There was a problem hiding this comment.
Positional list indexing causes unnecessary key churn on reorder/insert
{ for idx, key in var.ssh_public_keys : idx => key } maps list position to the Hetzner SSH key resource address. Inserting a key before the last position shifts every subsequent key's each.key, causing Terraform to plan renames or destroy+recreates of downstream SSH key objects.
Three issues raised by Greptile on the initial commit:
P1 deploy user had no SSH authorized_keys, so the auto-deploy
workflow (which SSHes as `deploy`, not root) would fail until
keys were copied out-of-band. cloud-init now expands the same
operator key list into the deploy user via a Terraform-template
loop, so first-boot the user is reachable.
P2 (sec) Replaced `curl get.docker.com | sh` with the official
Docker apt repo + GPG-verified keyring (cloud-init handles the
keyring). Replaced `curl bun.sh/install | bash` with a pinned
GitHub release download whose SHA-256 is verified against the
same release's SHASUMS256.txt before extracting.
P2 Keyed hcloud_ssh_key.operators by sha256(key) prefix instead of
list index, so inserting an operator at the start of
var.ssh_public_keys no longer cascades into renames/recreates
of every subsequent SSH key resource.
…za-<n>
The shorter prefix matches the data-plane convention (eliza-core-<hex>)
and supports the in-place rename of the legacy prod VM (milady → eliza-1)
via Hetzner's PUT /servers/{id}. Environment moves to a label
(`environment = production|staging`) so the Hetzner Console can filter
without bloating every SSH command.
Also drops the inline import walkthrough from the README — it's a
one-shot adoption op that lives in operator scratch space, not in repo
docs that drift over time.
Summary
Three coordinated changes that move the Hetzner setup from "manually-poked VMs" to a proper two-tier architecture:
Terraform IaC for the control plane: declares the persistent VM(s) that host the orchestrator daemon (
provisioning-worker,agent-router,headscale,cloudflared). Useshetznercloud/hcloud+cloudflareproviders, Cloudflare R2 as S3 state backend. Includes cloud-init bootstrap, tfvars examples, and a README walkthrough forterraform importof the existing prod VM (89.167.63.246).Data-plane naming:
node-<hex>becomeseliza-core-<hex>going forward.generateNodeId()now sources entropy fromcrypto.getRandomValues()instead ofMath.random().toString(16).slice(2, 10)— the latter silently strips trailing zeros and could produce short or colliding suffixes whennode_idis UNIQUE indocker_nodes.Data-plane location default fixed: Hetzner deprecated
cpx32onash(Ashburn), sodefaultHcloudLocation = "ash"failed with "unsupported location for server type". Flipped tofsn1to match the actual prod fleet.Migration 0132: disables the 6 legacy
milady-core-*rows (enabled=false,capacity=8). They were inserted by hand in 2026-03 withcapacity=100(unrealistic for cpx32), have been health-check offline for weeks, and are now ignored by the autoscaler. Existing sandboxes keep running on the underlying Docker daemons until their next user-triggered restart, at which point the daemon provisions a replacement on a fresh autoscaled core. Ops follow-up (delete Hetzner servers + DB rows) is documented inARCHITECTURE.md.packages/cloud-infra/cloud/terraform/hetzner/ARCHITECTURE.mdformalises the two-tier model (static control plane vs elastic data plane) so future ops actions have a clear runbook.What this PR does NOT do (followups)
terraform-applyGitHub workflowpool-replenish,pool-health-check,pool-image-rollout,deployment-monitor) off the orphancontainer-control-planeservice onto the daemon-queue pattern, then retiring the service entirelyTest plan
bun test packages/cloud-shared/src/lib/services/containers/node-autoscaler.test.ts— 5/5 pass (3 existing + 2 new for thegenerateNodeId()rename and entropy fix)bunx tsc --noEmitonpackages/cloud-shared— clean (pre-existing core/shared noise unrelated)terraform init -backend=false && terraform validateonpackages/cloud-infra/cloud/terraform/hetzner/control-plane/— successterraform fmt -recursive -check— cleaneliza-terraform-stateexists in WEUR (verified)eliza-core-<hex>provisions via autoscale and milady-core-* sandboxes drain naturally on restartRan the
/cleanskill on this PRmodules/manager-vm/wrapper, removed a dead-exported helper (-87 LOC)ashvsfsn1inconsistency — fixed in this PRMath.random()slicing bug, replaced withcrypto.getRandomValuesgenerateNodeId()Out-of-band ops actions needed for production rollout
eliza-terraform-stateexists)HETZNER_CLOUD_API_KEY+CONTAINERS_AUTOSCALE_PUBLIC_SSH_KEYon the daemon (already done on staging VM at 89.167.63.246)Greptile Summary
This PR transitions the Hetzner setup to a two-tier architecture with Terraform IaC for the static control plane, fixes the
defaultHcloudLocationfallback from the deprecatedashtofsn1, replacesMath.random()-based node ID generation withcrypto.getRandomValues(), renames node IDs fromnode-<hex>toeliza-core-<hex>, and disables the six legacymilady-core-*DB rows via migration 0132.hetzner/control-plane/module declares Hetzner VMs + Cloudflare DNS records backed by Cloudflare R2 state; cloud-init template handles first-boot setup of Docker, Bun, and the deploy user.generateNodeId()fix: 4 bytes fromcrypto.getRandomValues()hex-encoded withpadStartguarantees exactly 8 hex characters, preventing the trailing-zero truncation bug in the previousMath.random().toString(16).slice(2,10)path.milady-core-*rows toenabled=false, capacity=8so the autoscaler ignores them while live sandboxes drain naturally on their next restart.Confidence Score: 3/5
The TypeScript and migration changes are safe to merge, but the Terraform module has two gaps that would prevent a usable deployment: no guard against an empty SSH key list (produces a permanently inaccessible VM) and no authorized_keys injection for the deploy user (blocks the GitHub Actions deploy workflow).
The node-autoscaler entropy fix and the milady-core migration are clean and well-tested. The defaultHcloudLocation fix is a straightforward one-liner. The risk sits entirely in the new Terraform module: applying with the default empty ssh_public_keys creates a VM nobody can access, and the deploy user created by cloud-init has no SSH authorized keys so the expected deploy workflow cannot SSH in.
variables.tf (missing SSH key validation) and cloud-init/bootstrap.yaml.tftpl (missing ssh_authorized_keys for the deploy user) need attention before the module is run against any environment.
Security Review
bootstrap.yaml.tftpllines 49–53): both Docker (curl -fsSL https://get.docker.com | sh) and Bun (curl -fsSL https://bun.sh/install | bash) are fetched and executed without checksum verification. The control-plane VM holdsDATABASE_URL,HCLOUD_TOKEN, Headscale state, and the cloudflared tunnel — a supply-chain or MITM attack at bootstrap time would silently compromise the entire control plane.Important Files Changed
Comments Outside Diff (1)
packages/cloud-infra/cloud/terraform/hetzner/control-plane/variables.tf, line 536-540 (link)terraform applyvar.ssh_public_keysdefaults to[], and there is novalidationblock requiring at least one entry. When Hetzner creates a server, it injects the listed SSH public keys into root's~/.ssh/authorized_keys. With an empty list, the VM boots with no authorized key for root, and thedeployuser also has no keys (see the cloud-init template). The only recovery is Hetzner rescue mode.Reviews (1): Last reviewed commit: "feat(cloud): Hetzner control plane IaC +..." | Re-trigger Greptile